17 research outputs found
Multi-Constraint Molecular Generation using Sparsely Labelled Training Data for Localized High-Concentration Electrolyte Diluent Screening
Recently, machine learning methods have been used to propose molecules with
desired properties, which is especially useful for exploring large chemical
spaces efficiently. However, these methods rely on fully labelled training
data, and are not practical in situations where molecules with multiple
property constraints are required. There is often insufficient training data
for all those properties from publicly available databases, especially when
ab-initio simulation or experimental property data is also desired for training
the conditional molecular generative model. In this work, we show how to modify
a semi-supervised variational auto-encoder (SSVAE) model which only works with
fully labelled and fully unlabelled molecular property training data into the
ConGen model, which also works on training data that have sparsely populated
labels. We evaluate ConGen's performance in generating molecules with multiple
constraints when trained on a dataset combined from multiple publicly available
molecule property databases, and demonstrate an example application of building
the virtual chemical space for potential Lithium-ion battery localized
high-concentration electrolyte (LHCE) diluents
The Lifecycle and Cascade of WeChat Social Messaging Groups
Social instant messaging services are emerging as a transformative form with
which people connect, communicate with friends in their daily life - they
catalyze the formation of social groups, and they bring people stronger sense
of community and connection. However, research community still knows little
about the formation and evolution of groups in the context of social messaging
- their lifecycles, the change in their underlying structures over time, and
the diffusion processes by which they develop new members. In this paper, we
analyze the daily usage logs from WeChat group messaging platform - the largest
standalone messaging communication service in China - with the goal of
understanding the processes by which social messaging groups come together,
grow new members, and evolve over time. Specifically, we discover a strong
dichotomy among groups in terms of their lifecycle, and develop a separability
model by taking into account a broad range of group-level features, showing
that long-term and short-term groups are inherently distinct. We also found
that the lifecycle of messaging groups is largely dependent on their social
roles and functions in users' daily social experiences and specific purposes.
Given the strong separability between the long-term and short-term groups, we
further address the problem concerning the early prediction of successful
communities. In addition to modeling the growth and evolution from group-level
perspective, we investigate the individual-level attributes of group members
and study the diffusion process by which groups gain new members. By
considering members' historical engagement behavior as well as the local social
network structure that they embedded in, we develop a membership cascade model
and demonstrate the effectiveness by achieving AUC of 95.31% in predicting
inviter, and an AUC of 98.66% in predicting invitee.Comment: 10 pages, 8 figures, to appear in proceedings of the 25th
International World Wide Web Conference (WWW 2016
Protein-Ligand Complex Generator & Drug Screening via Tiered Tensor Transform
The generation of small molecule candidate (ligand) binding poses in its
target protein pocket is important for computer-aided drug discovery. Typical
rigid-body docking methods ignore the pocket flexibility of protein, while the
more accurate pose generation using molecular dynamics is hindered by slow
protein dynamics. We develop a tiered tensor transform (3T) algorithm to
rapidly generate diverse protein-ligand complex conformations for both pose and
affinity estimation in drug screening, requiring neither machine learning
training nor lengthy dynamics computation, while maintaining both
coarse-grain-like coordinated protein dynamics and atomistic-level details of
the complex pocket. The 3T conformation structures we generate achieve
significantly higher accuracy in active ligand classification than traditional
ensemble docking using hundreds of experimental protein conformations.
Furthermore, we demonstrate that 3T can be used to explore distant
protein-ligand binding poses within the protein pocket. 3T structure
transformation is decoupled from the system physics, making future usage in
other computational scientific domains possible
Towards Lightweight and Automated Representation Learning System for Networks
We propose LIGHTNE 2.0, a cost-effective, scalable, automated, and
high-quality network embedding system that scales to graphs with hundreds of
billions of edges on a single machine. In contrast to the mainstream belief
that distributed architecture and GPUs are needed for large-scale network
embedding with good quality, we prove that we can achieve higher quality,
better scalability, lower cost, and faster runtime with shared-memory, CPU-only
architecture. LIGHTNE 2.0 combines two theoretically grounded embedding methods
NetSMF and ProNE. We introduce the following techniques to network embedding
for the first time: (1) a newly proposed downsampling method to reduce the
sample complexity of NetSMF while preserving its theoretical advantages; (2) a
high-performance parallel graph processing stack GBBS to achieve high memory
efficiency and scalability; (3) sparse parallel hash table to aggregate and
maintain the matrix sparsifier in memory; (4) a fast randomized singular value
decomposition (SVD) enhanced by power iteration and fast orthonormalization to
improve vanilla randomized SVD in terms of both efficiency and effectiveness;
(5) Intel MKL for proposed fast randomized SVD and spectral propagation; and
(6) a fast and lightweight AutoML library FLAML for automated hyperparameter
tuning. Experimental results show that LIGHTNE 2.0 can be up to 84X faster than
GraphVite, 30X faster than PBG and 9X faster than NetSMF while delivering
better performance. LIGHTNE 2.0 can embed very large graph with 1.7 billion
nodes and 124 billion edges in half an hour on a CPU server, while other
baselines cannot handle very large graphs of this scale
Spatio-Temporal Contrastive Learning Enhanced GNNs for Session-based Recommendation
Session-based recommendation (SBR) systems aim to utilize the user's
short-term behavior sequence to predict the next item without the detailed user
profile. Most recent works try to model the user preference by treating the
sessions as between-item transition graphs and utilize various graph neural
networks (GNNs) to encode the representations of pair-wise relations among
items and their neighbors. Some of the existing GNN-based models mainly focus
on aggregating information from the view of spatial graph structure, which
ignores the temporal relations within neighbors of an item during message
passing and the information loss results in a sub-optimal problem. Other works
embrace this challenge by incorporating additional temporal information but
lack sufficient interaction between the spatial and temporal patterns. To
address this issue, inspired by the uniformity and alignment properties of
contrastive learning techniques, we propose a novel framework called
Session-based Recommendation with Spatio-Temporal Contrastive Learning Enhanced
GNNs (RESTC). The idea is to supplement the GNN-based main supervised
recommendation task with the temporal representation via an auxiliary
cross-view contrastive learning mechanism. Furthermore, a novel global
collaborative filtering graph (CFG) embedding is leveraged to enhance the
spatial view in the main task. Extensive experiments demonstrate the
significant performance of RESTC compared with the state-of-the-art baselines
e.g., with an improvement as much as 27.08% gain on HR@20 and 20.10% gain on
[email protected]: Under reviewing draft of ACM TOI